Mlogit Benchmark 1

The purpose of this notebook is to:

Demonstrate the use of the pyLogit to estimate conditional logit models.
Benchmark the results reported pyLogit against those reported by the mlogit package.

The models estimated in this notebook will be as follows:

The "Train" model described on page 22 of the mlogit documentation.

 ml.Train <- mlogit(choice ~ price + time + change + comfort | -1, Tr)

The "Fishing" model described on pages 23-24 of the mlogit documentation

 ml.Fish <- mlogit(mode ~ price | income | catch, Fishing, shape = "wide", varying = 2:9)

1. Import Needed libraries



In [1]:

    
from collections import OrderedDict    # For recording the model specification 

import pandas as pd                    # For file input/output
import numpy as np                     # For vectorized math operations

import pylogit as pl                   # For MNL model estimation
                                       # To convert from wide to long format

2. Load and look at the required datasets



In [2]:

    
# Load the Train data, noting that the data is in wide data format
wide_train_df = pd.read_csv("../data/train_data_r.csv")

# Load the Fishing data, noting that the data is in wide data format
wide_fishing_df = pd.read_csv("../data/fishing_data_r.csv")



In [3]:

    
# Look at the raw Train data
wide_train_df.head().T



In [4]:

    
# Look at the raw Fishing data
wide_fishing_df.head().T









    Out[4]:






  
    
      
      0
      1
      2
      3
      4
    
  
  
    
      mode
      charter
      charter
      boat
      pier
      boat
    
    
      price.beach
      157.93
      15.114
      161.874
      15.134
      106.93
    
    
      price.pier
      157.93
      15.114
      161.874
      15.134
      106.93
    
    
      price.boat
      157.93
      10.534
      24.334
      55.93
      41.514
    
    
      price.charter
      182.93
      34.534
      59.334
      84.93
      71.014
    
    
      catch.beach
      0.0678
      0.1049
      0.5333
      0.0678
      0.0678
    
    
      catch.pier
      0.0503
      0.0451
      0.4522
      0.0789
      0.0503
    
    
      catch.boat
      0.2601
      0.1574
      0.2413
      0.1643
      0.1082
    
    
      catch.charter
      0.5391
      0.4671
      1.0266
      0.5391
      0.324
    
    
      income
      7083.332
      1250
      3750
      2083.333
      4583.332

3. Convert the wide format dataframes to long format

3a. Perform needed data cleaning

Recognizing that the Train dataset is a panel dataset, and recognizing that our estimated MNL model will not take the panel nature of the data into account, we need a new id column that specifies each individual choice situation.

In a similar fashion, the Fishing data needs an observation id column, even though it is not a panel dataset. All datasets being used in pyLogit need an "observation" id column that denotes the id of what is being thought of as the unit of observation being modeled.



In [5]:

    
# Note that we start the ids for the choice situations at 1.
wide_train_df["choice_situation"] = wide_train_df.index.values + 1

wide_fishing_df["observation"] = wide_fishing_df.index.values + 1

Noting that the columns denoting the choice for both the Train and the Fishing data are string objects, we need to convert the choice columns into integer based columns.



In [6]:

    
# Convert the choice column for the Train data into integers
# Note that we will use a 1 to denote 'choice1' and a 2 to 
# represent 'choice2'
wide_train_df["choice"] = wide_train_df["choice"].map({'choice1': 1,
                                                       'choice2': 2})

# Convert the "mode" column for the Fishing data into an
# integer based column. Use the following mapping:
mode_name_to_id = dict(zip(["beach", "pier", "boat", "charter"],
                           range(1, 5)))
wide_fishing_df["mode"] = wide_fishing_df["mode"].map(mode_name_to_id)

For both the Train and the Fishing data, all of the alternatives are available in all choice situations. Note that, in general, this is not the case for choice data. As such we need to have columns that denote the availability of each alternative for each individual.

These columns will all be filled with ones for each row in the wide format dataframes because all of the alternatives are always available for each individual.



In [7]:

    
# Create the needed availability columns for the Train data
# where each choice is a binary decision
for i in [1, 2]:
    wide_train_df["availability_{}".format(i)] = 1
    
# Create the needed availability columns for the Fishing data
# where each choice has four available alternatives
for i in range(1, 5):
    wide_fishing_df["availability_{}".format(i)] = 1

3b. Convert the Train dataset to long format



In [8]:

    
# Look at the columns that we need to account for when converting from
# the wide data format to the long data format.
wide_train_df.columns









    Out[8]:





Index([u'id', u'choiceid', u'choice', u'price1', u'time1', u'change1',
       u'comfort1', u'price2', u'time2', u'change2', u'comfort2',
       u'choice_situation', u'availability_1', u'availability_2'],
      dtype='object')



In [9]:

    
##########
# Define lists of the variables pertaining to each variable type
# that we need to account for in the data format transformation
##########
# Determine the name for the alternative ids in the long format 
# data frame
train_alt_id = "alt_id"
# Determine the column that denotes the id of what we're treating
# as individual observations, i.e. the choice situations.
train_obs_id_col = "choice_situation"
# Determine what column denotes the choice that was made
train_choice_column = "choice"


# Create the list of observation specific variables
train_ind_variables = ["id", "choiceid"]

# Specify the variables that vary across individuals and some or all alternatives
# Note that each "main" key should be the desired name of the column in the long
# data format. The inner keys shoud be the alternative ids that that have some
# value for the "main" key variable.
train_alt_varying_variables = {"price": {1: "price1",
                                         2: "price2"},
                               "time": {1: "time1",
                                        2: "time2"},
                               "change": {1: "change1",
                                          2: "change2"},
                               "comfort": {1: "comfort1",
                                           2: "comfort2"}
                               }

# Specify the availability variables
train_availability_variables = OrderedDict()
for alt_id, var in zip([1, 2], ["availability_1", "availability_2"]):
    train_availability_variables[alt_id] = var



In [10]:

    
##########
# Actually perform the conversion to long format
##########
long_train_df = pl.convert_wide_to_long(wide_data=wide_train_df,
                                        ind_vars=train_ind_variables,
                                        alt_specific_vars=train_alt_varying_variables,
                                        availability_vars=train_availability_variables,
                                        obs_id_col=train_obs_id_col,
                                        choice_col=train_choice_column,
                                        new_alt_id_name=train_alt_id)

# Look at the long format train data
long_train_df.head()









    Out[10]:






  
    
      
      choice_situation
      alt_id
      choice
      id
      choiceid
      price
      comfort
      change
      time
    
  
  
    
      0
      1
      1
      1
      1
      1
      2400
      1
      0
      150
    
    
      1
      1
      2
      0
      1
      1
      4000
      1
      0
      150
    
    
      2
      2
      1
      1
      1
      2
      2400
      1
      0
      150
    
    
      3
      2
      2
      0
      1
      2
      3200
      1
      0
      130
    
    
      4
      3
      1
      1
      1
      3
      2400
      1
      0
      115

3c. Convert the Fishing data to long format



In [11]:

    
# Look at the columns that we need to account for when converting from
# the wide data format to the long data format.
wide_fishing_df.columns









    Out[11]:





Index([u'mode', u'price.beach', u'price.pier', u'price.boat', u'price.charter',
       u'catch.beach', u'catch.pier', u'catch.boat', u'catch.charter',
       u'income', u'observation', u'availability_1', u'availability_2',
       u'availability_3', u'availability_4'],
      dtype='object')



In [12]:

    
##########
# Define lists of the variables pertaining to each variable type
# that we need to account for in the data format transformation
##########
# Determine the name for the alternative ids in the long format 
# data frame
fishing_alt_id = "alt_id"
# Determine the column that denotes the id of what we're treating
# as individual observations, i.e. the choice situations.
fishing_obs_id_col = "observation"
# Determine what column denotes the choice that was made
fishing_choice_column = "mode"


# Create the list of observation specific variables
fishing_ind_variables = ["income"]

# Specify the variables that vary across individuals and some or all alternatives
# Note that each "main" key should be the desired name of the column in the long
# data format. The inner keys shoud be the alternative ids that that have some
# value for the "main" key variable.
fishing_alt_varying_variables = {"price": {1: "price.beach",
                                           2: "price.pier",
                                           3: "price.boat",
                                           4: "price.charter"},
                                 "catch": {1: "catch.beach",
                                           2: "catch.pier",
                                           3: "catch.boat",
                                           4: "catch.charter"},
                                 }

# Specify the availability variables
fishing_availability_variables = OrderedDict()
for alt_id, var in zip(range(1, 5), ["availability_{}".format(x) for x in range(1, 5)]):
    fishing_availability_variables[alt_id] = var



In [13]:

    
##########
# Actually perform the conversion to long format
##########
long_fishing_df = pl.convert_wide_to_long(wide_data=wide_fishing_df,
                                          ind_vars=fishing_ind_variables,
                                          alt_specific_vars=fishing_alt_varying_variables,
                                          availability_vars=fishing_availability_variables,
                                          obs_id_col=fishing_obs_id_col,
                                          choice_col=fishing_choice_column,
                                          new_alt_id_name=fishing_alt_id)

# Look at the long format train data
long_fishing_df.head()









    Out[13]:






  
    
      
      observation
      alt_id
      mode
      income
      catch
      price
    
  
  
    
      0
      1
      1
      0
      7083.3317
      0.0678
      157.930
    
    
      1
      1
      2
      0
      7083.3317
      0.0503
      157.930
    
    
      2
      1
      3
      0
      7083.3317
      0.2601
      157.930
    
    
      3
      1
      4
      1
      7083.3317
      0.5391
      182.930
    
    
      4
      2
      1
      0
      1249.9998
      0.1049
      15.114

4. Create desired variables



In [14]:

    
# For the Train data, scale the price and time variables so the units
# are meaningful, namely hours and euros.
long_train_df["price_euros"] = long_train_df["price"] / 100.0 * 2.20371
long_train_df["time_hours"] = long_train_df["time"] / 60.0

For numeric stability reasons, it is advised that one scale one's variables so that the estimated coefficients are similar in absolute magnitude, and if possible so that the estimated coefficients are close to 1 in absolute value (in other words, not terribly tiny or extremely large). This is done for the fishing data below



In [15]:

    
# Scale the income data
long_fishing_df["income_thousandth"] = long_fishing_df["income"] / 1000.0

5. Specify and estimate the desired models needed for benchmarking

5a. Train Model

Note that this dataset is a stated-preference dataset with unlabeled alternatives. Because the unobserved elements that affect a person's choice are assumed to be the same for both alternatives, the mean of the error terms is expected to be the same for the two alternatives, therefore the alternative specific constants (ASCs) would be the same and the difference between the two ASCs would be zero. Because of this, no ASCs are estimated for the Train model.



In [16]:

    
# Look at the columns available for use in specifying the model
long_train_df.columns









    Out[16]:





Index([u'choice_situation', u'alt_id', u'choice', u'id', u'choiceid', u'price',
       u'comfort', u'change', u'time', u'price_euros', u'time_hours'],
      dtype='object')



In [17]:

    
# Create the model specification
train_spec = OrderedDict()
train_names = OrderedDict()

# Note that for the specification dictionary, the
# keys should be the column names from the long format
# dataframe and the values should be a list with a combination
# of alternative id's and/or lists of alternative id's. There 
# should be one element for each beta that will be estimated 
# in relation to the given column. Lists of alternative id's
# mean that all of the alternatives in the list will get a
# single beta for them, for the given variable.
# The names dictionary should contain one name for each
# element (that is each alternative id or list of alternative 
# ids) in the specification dictionary value for the same 
# variable
for col, display_name in [("price_euros", "price"), 
                          ("time_hours", "time"), 
                          ("change", "change"), 
                          ("comfort", "comfort")]:
    train_spec[col] = [[1, 2]]
    train_names[col] = [display_name]



In [18]:

    
# Create an instance of the MNL model class
train_model = pl.create_choice_model(data=long_train_df,
                                     alt_id_col=train_alt_id,
                                     obs_id_col=train_obs_id_col,
                                     choice_col=train_choice_column,
                                     specification=train_spec,
                                     model_type="MNL",
                                     names=train_names)

# Estimate the given model, starting from a point of all zeros
# as the initial values.
train_model.fit_mle(np.zeros(4))

# Look at the estimation summaries
train_model.get_statsmodels_summary()









    



Log-likelihood at zero: -2,030.2281
Initial Log-likelihood: -2,030.2281
Estimation Time: 0.02 seconds.
Final log-likelihood: -1,724.1500






    



/Users/timothyb0912/anaconda/lib/python2.7/site-packages/scipy/optimize/_minimize.py:382: RuntimeWarning: Method BFGS does not use Hessian information (hess).
  RuntimeWarning)






    Out[18]:





Multinomial Logit Model Regression Results

  Dep. Variable:          choice             No. Observations:      2,929  


  Model:          Multinomial Logit Model    Df Residuals:          2,925  


  Method:                   MLE              Df Model:                4    


  Date:              Mon, 14 Mar 2016        Pseudo R-squ.:         0.151  


  Time:                  18:50:31            Pseudo R-bar-squ.:     0.149  


  converged:               True              Log-Likelihood:     -1,724.150


                                           LL-Null:            -2,030.228




             coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  price       -0.0674      0.003    -19.851   0.000     -0.074    -0.061


  time        -1.7206      0.160    -10.730   0.000     -2.035    -1.406


  change      -0.3263      0.059     -5.486   0.000     -0.443    -0.210


  comfort     -0.9457      0.065    -14.562   0.000     -1.073    -0.818

Look at the corresponding results from mlogit:

Call:
mlogit(formula = choice ~ price + time + change + comfort | -1,
    data = Tr, method = "nr", print.level = 0)
Frequencies of alternatives: 12
0.50324 0.49676
nr method
5 iterations, 0h:0m:0s
g'(-H)^-1g = 0.00014
successive fonction values within tolerance limits
Coefficients :
          Estimate Std. Error  t-value  Pr(>|t|)
price   -0.0673580  0.0033933 -19.8506 < 2.2e-16 ***
time    -1.7205514  0.1603517 -10.7299 < 2.2e-16 ***
change  -0.3263409  0.0594892  -5.4857 4.118e-08 ***
comfort -0.9457256  0.0649455 -14.5618 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood: -1724.2

In terms of differences between the estimation output of mlogit and my estimation output, the differencs seem mainly only be with the p-values. My p-values are calculated with respect to an asymptotic normal distribution whereas the p-values of mlogit are based on the t-distribution. This accounts for the p-value difference.

There is a very slight difference between mlogits value for the time parameter and my own time parameter estimate, but this may simply be due to the convergance criteria that each of the packages are using.

5b. Fishing model



In [19]:

    
# Look at the columns available for use in specifying the model
long_fishing_df.columns









    Out[19]:





Index([u'observation', u'alt_id', u'mode', u'income', u'catch', u'price',
       u'income_thousandth'],
      dtype='object')



In [20]:

    
# Create the model specification
fishing_spec = OrderedDict()
fishing_names = OrderedDict()

# Note that for the specification dictionary, the
# keys should be the column names from the long format
# dataframe and the values should be a list with a combination
# of alternative id's and/or lists of alternative id's. There 
# should be one element for each beta that will be estimated 
# in relation to the given column. Lists of alternative id's
# mean that all of the alternatives in the list will get a
# single beta for them, for the given variable.
# The names dictionary should contain one name for each
# element (that is each alternative id or list of alternative 
# ids) in the specification dictionary value for the same 
# variable

# Note the intercept for beach is constrained to zero for identification
fishing_spec["intercept"] = range(2, 5)
fishing_names["intercept"] = ["ASC: pier",
                              "ASC: boat",
                              "ASC: charter"]

fishing_spec["price"] = [[1, 2, 3, 4]]
fishing_names["price"] = ["price"]

# Note the income coefficient for beach is constrained to zero for identification
# Note also that we use the scaled variables because they numerically perform better

# fishing_spec["income"] = range(2, 5)
# fishing_names["income"] = ["income_{}".format(x) 
#                            for x in ["pier", "boat", "charter"]]

fishing_spec["income_thousandth"] = range(2, 5)
fishing_names["income_thousandth"] = ["income_{} / 1000".format(x) 
                                      for x in ["pier", "boat", "charter"]]

fishing_spec["catch"] = range(1, 5)
fishing_names["catch"] = ["catch_{}".format(x) for x in
                          ["beach", "pier", "boat", "charter"]]



In [21]:

    
# Create an instance of the MNL model class
fishing_model = pl.create_choice_model(data=long_fishing_df,
                                       alt_id_col=fishing_alt_id,
                                       obs_id_col=fishing_obs_id_col,
                                       choice_col=fishing_choice_column,
                                       specification=fishing_spec,
                                       model_type="MNL",
                                       names=fishing_names)

# Estimate the given model, starting from a point of all zeros
# as the initial values.
fishing_model.fit_mle(np.zeros(11))

# Look at the estimation summaries
fishing_model.get_statsmodels_summary()









    



Log-likelihood at zero: -1,638.5999
Initial Log-likelihood: -1,638.5999
Estimation Time: 0.02 seconds.
Final log-likelihood: -1,199.1434






    Out[21]:





Multinomial Logit Model Regression Results

  Dep. Variable:           mode              No. Observations:      1,182  


  Model:          Multinomial Logit Model    Df Residuals:          1,171  


  Method:                   MLE              Df Model:               11    


  Date:              Mon, 14 Mar 2016        Pseudo R-squ.:         0.268  


  Time:                  18:50:31            Pseudo R-bar-squ.:     0.261  


  converged:               False             Log-Likelihood:     -1,199.143


                                           LL-Null:            -1,638.600




                           coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  ASC: pier                  1.0430      0.295      3.531   0.000      0.464     1.622


  ASC: boat                  0.8418      0.300      2.807   0.005      0.254     1.430


  ASC: charter               2.1549      0.297      7.244   0.000      1.572     2.738


  price                     -0.0253      0.002    -14.405   0.000     -0.029    -0.022


  income_pier / 1000        -0.1355      0.051     -2.648   0.008     -0.236    -0.035


  income_boat / 1000         0.0554      0.052      1.063   0.288     -0.047     0.158


  income_charter / 1000     -0.0723      0.053     -1.376   0.169     -0.175     0.031


  catch_beach                3.1177      0.713      4.372   0.000      1.720     4.515


  catch_pier                 2.8512      0.775      3.681   0.000      1.333     4.369


  catch_boat                 2.5425      0.523      4.864   0.000      1.518     3.567


  catch_charter              0.7595      0.154      4.925   0.000      0.457     1.062

Look at the results from mlogit:

Call:
mlogit(formula = mode ~ price | income | catch, data = Fishing, 
    shape = "wide", varying = 2:9, method = "nr", print.level = 0)

Frequencies of alternatives:
  beach    boat charter    pier 
0.11337 0.35364 0.38240 0.15059 

nr method
7 iterations, 0h:0m:0s 
g'(-H)^-1g = 2.54E-05 
successive function values within tolerance limits 

Coefficients :
                       Estimate  Std. Error  t-value  Pr(>|t|)    
boat:(intercept)     8.4184e-01  2.9996e-01   2.8065 0.0050080 ** 
charter:(intercept)  2.1549e+00  2.9746e-01   7.2443 4.348e-13 ***
pier:(intercept)     1.0430e+00  2.9535e-01   3.5315 0.0004132 ***
price               -2.5281e-02  1.7551e-03 -14.4046 < 2.2e-16 ***
boat:income          5.5428e-05  5.2130e-05   1.0633 0.2876612    
charter:income      -7.2337e-05  5.2557e-05  -1.3764 0.1687088    
pier:income         -1.3550e-04  5.1172e-05  -2.6480 0.0080977 ** 
beach:catch          3.1177e+00  7.1305e-01   4.3724 1.229e-05 ***
boat:catch           2.5425e+00  5.2274e-01   4.8638 1.152e-06 ***
charter:catch        7.5949e-01  1.5420e-01   4.9254 8.417e-07 ***
pier:catch           2.8512e+00  7.7464e-01   3.6807 0.0002326 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Log-Likelihood: -1199.1
McFadden R^2:  0.19936 
Likelihood ratio test : chisq = 597.16 (p.value = < 2.22e-16)

The coefficient estimates, std. error, and t-values are exactly equal. The p-value differences are, as already noted, because mlogit uses the t-distribution whereas pyLogit uses an asymptotic normal distribution. The log-likelihoods of our final model is also the same.

Note that the McFadden R^2 values are different. I am not sure how the mlogit value is calculated. From "Coefficients of Determination for Multiple Logistic Regression Analysis" by Scott Menard (2000), The American Statistician, 54:1, 17-24, we have the following equation:

$\textrm{McFadden's R^2} = 1 - \frac{\mathscr{L}_M}{\mathscr{L}_0}$ which evaluates to 0.2681902 not 0.19936. This is provided that my null-log-likelihood is correct. The next cell shows this to be the case and verifies pyLogit's calculated McFadden's R^2.



In [22]:

    
##########
# Make sure that pyLogit's null log-likelihood 
# and McFadden's R^2 are correct
##########
# Note that every observation in the Fishing dataset
# has 4 available alternatives, therefore the null 
# probability is 0.25
null_prob = 0.25

# Calculate how many observations are in the Fishing 
# dataset
num_fishing_obs = wide_fishing_df.shape[0]

# Calculate the Fishing dataset's null log-likelihood 
null_fishing_log_likelihood = (num_fishing_obs * 
                               np.log(null_prob))

# Determine whether pyLogit's null log-likelihood is correct
correct_null_ll = (null_fishing_log_likelihood == 
                   fishing_model.null_log_likelihood)
print "pyLogit's null log-likelihood is correct:", correct_null_ll

# Calculate McFadden's R^2
mcfaddens_r2 = 1 - (fishing_model.log_likelihood / fishing_model.null_log_likelihood)
print "McFadden's R^2 is {:.5f}".format(mcfaddens_r2)









    



pyLogit's null log-likelihood is correct: True
McFadden's R^2 is 0.26819

	0	1	2	3	4
id	1	1	1	1	1
choiceid	1	2	3	4	5
choice	choice1	choice1	choice1	choice2	choice2
price1	2400	2400	2400	4000	2400
time1	150	150	115	130	150
change1	0	0	0	0	0
comfort1	1	1	1	1	1
price2	4000	3200	4000	3200	3200
time2	150	130	115	150	150
change2	0	0	0	0	0
comfort2	1	1	0	0	0

	0	1	2	3	4
mode	charter	charter	boat	pier	boat
price.beach	157.93	15.114	161.874	15.134	106.93
price.pier	157.93	15.114	161.874	15.134	106.93
price.boat	157.93	10.534	24.334	55.93	41.514
price.charter	182.93	34.534	59.334	84.93	71.014
catch.beach	0.0678	0.1049	0.5333	0.0678	0.0678
catch.pier	0.0503	0.0451	0.4522	0.0789	0.0503
catch.boat	0.2601	0.1574	0.2413	0.1643	0.1082
catch.charter	0.5391	0.4671	1.0266	0.5391	0.324
income	7083.332	1250	3750	2083.333	4583.332

	observation	alt_id	mode	income	catch	price
0	1	1	0	7083.3317	0.0678	157.930
1	1	2	0	7083.3317	0.0503	157.930
2	1	3	0	7083.3317	0.2601	157.930
3	1	4	1	7083.3317	0.5391	182.930
4	2	1	0	1249.9998	0.1049	15.114

Dep. Variable:	choice	No. Observations:	2,929
Model:	Multinomial Logit Model	Df Residuals:	2,925
Method:	MLE	Df Model:	4
Date:	Mon, 14 Mar 2016	Pseudo R-squ.:	0.151
Time:	18:50:31	Pseudo R-bar-squ.:	0.149
converged:	True	Log-Likelihood:	-1,724.150
		LL-Null:	-2,030.228

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
price	-0.0674	0.003	-19.851	0.000	-0.074 -0.061
time	-1.7206	0.160	-10.730	0.000	-2.035 -1.406
change	-0.3263	0.059	-5.486	0.000	-0.443 -0.210
comfort	-0.9457	0.065	-14.562	0.000	-1.073 -0.818